Celestin Apprentice 5

home *** CD-ROM | disk | FTP | other *** search

/ Celestin Apprentice 5 / Apprentice-Release5.iso / Source Code / Libraries / VideoToolbox 96.06.15 / (Notes) / Fast blitting.doc < prev next >

Wrap

Text File | 1996-01-25 | 29KB | 731 lines

C.S.M.P. Digest Tue, 19 Dec 95 Volume 3 : Issue 128 >From erichsen@pacificnet.net (Erichsen) Subject: Doubles Vs BlockMove Date: 16 Nov 1995 02:22:08 GMT Organization: Disorganized I did some tests (modifying the code in MoveData app from Tricks of the Mac Game Programming Gurus) between using doubles in a loop and BlockMove in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for BlockMove) so why don't more people use BlockMove? I compared BlockMove vs BlockMoveData and found no difference at all (both 146 ticks). Does BlockMove not flush the cache on a 6100? One of the replies to my previous question of why people don't just use BlockMove instead of a copying loop was that the data is not necessarily a block but, all the examples of blitters I've seen just copy one contiguous block of memory to another contiguous block of memory. Why couldn't BlockMove be used? +++++++++++++++++++++++++++ >From cameron_esfahani@powertalk.apple.com (Cameron Esfahani) Date: Mon, 20 Nov 1995 11:55:46 -0800 Organization: Apple Computer, Inc. BlockMove/BlockMoveData on the first generation PPC are exactly the same function. The reason that BlockMoveData was created in the first place was you could tell the system you were not moving code around and to not flush the instructino cache. Since the 601 has a unified cache, this means that you don't have to worry about cache-coherency. This means you don't have to flush the processor cache. The reason most people don't use BlockMove/BlockMoveData as a blitter is that it will be very very slow if you ever use the screen as the destination. The reason is that the BlockMove/BlockMoveData routines use the PPC instruction DCBZ. This instruction will cause a data-exception fault if the address supplied is not copy-back cacheable. The screen isn't marked copy-back cacheable. Hope this helps, Cameron Esfahani ******** >From nporcino@sol.uvic.ca (Nick Porcino) Date: 20 Nov 1995 20:35:30 GMT Organization: Planet IX We did some tests and found on a Q700 that BlockMoveData was faster than BlockMove in the context of an actual game (Riddle of Master Lu) - Nick Porcino Lead Engine Guy Sanctuary Woods +++++++++++++++++++++++++++ >From meggs@virginia.edu (Andrew Meggs) Date: Tue, 21 Nov 1995 02:55:08 GMT Organization: University of Virginia In article <erichsen-1511951722510001@pm2-3.pacificnet.net>, erichsen@pacificnet.net (Erichsen) wrote: > I did some tests (modifying the code in MoveData app from Tricks of the > Mac Game Programming Gurus) between using doubles in a loop and BlockMove > in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for > BlockMove) so why don't more people use BlockMove? > This got me interested, so I went and disassembled BlockMove. Surprisingly, they aren't using doubles: BlockMove +00060 40A1C558 lwz r5,0x0000(r3) +00064 40A1C55C lwz r6,0x0004(r3) +00068 40A1C560 lwz r7,0x0008(r3) +0006C 40A1C564 lwz r8,0x000C(r3) +00070 40A1C568 lwz r9,0x0010(r3) +00074 40A1C56C lwz r10,0x0014(r3) +00078 40A1C570 lwz r11,0x0018(r3) +0007C 40A1C574 lwz r12,0x001C(r3) +00080 40A1C578 dcbz 0,r4 +00084 40A1C57C addi r3,r3,0x0020 +00088 40A1C580 dcbt 0,r3 +0008C 40A1C584 stw r5,0x0000(r4) +00090 40A1C588 stw r6,0x0004(r4) +00094 40A1C58C stw r7,0x0008(r4) +00098 40A1C590 stw r8,0x000C(r4) +0009C 40A1C594 stw r9,0x0010(r4) +000A0 40A1C598 stw r10,0x0014(r4) +000A4 40A1C59C stw r11,0x0018(r4) +000A8 40A1C5A0 stw r12,0x001C(r4) +000AC 40A1C5A4 addi r4,r4,0x0020 +000B0 40A1C5A8 bdnz BlockMove+00060 The performance win is in the dcbz/dcbt pair. I'm assuming you weren't copying to video memory, because that's marked uncacheable, and dcbz will severely hurt performance if your destination is uncacheable. I probably would have written it more like this, personally. Does anyone have any idea what makes Apple's better? (Assuming it is...) ;assume source, destination, and size are all 32-byte aligned ;set r3 to source address minus 8 and r4 to destination address minus 8 ;set ctr to size >> 5 BlockMoveLoop lfd fp0,8(r3) lfd fp1,16(r3) lfd fp2,24(r3) lfdu fp3,32(r3) dcbz 0,r4 dcbt 0,r3 stfd fp0,8(r4) stfd fp1,16(r4) stfd fp2,24(r4) stfdu fp3,32(r4) bdnz BlockMoveLoop > I compared BlockMove vs BlockMoveData and found no difference at all (both > 146 ticks). Does BlockMove not flush the cache on a 6100? > The unified instruction and data cache on the 601 wouldn't cause any problems with treating code as data, so there's no need to maintain coherency between the two. In other words, it shouldn't, but on the 604 it would need to. -- _________________________________________________________________________ andrew meggs the one who dies with the most meggs@virginia.edu AOL free trial disks wins _________________________________________________________________________ dead tv software --==-- the next generation of 3D games for the macintosh <http://darwin.clas.virginia.edu/~apm3g/deadtv/index.html> +++++++++++++++++++++++++++ >From Mark Williams <Mark@streetly.demon.co.uk> Date: Wed, 22 Nov 95 09:42:32 GMT Organization: Streetly Software In article <meggs-2011952155080001@bootp-188-82.bootp.virginia.edu>, Andrew Meggs writes: > > In article <erichsen-1511951722510001@pm2-3.pacificnet.net>, > erichsen@pacificnet.net (Erichsen) wrote: > > > I did some tests (modifying the code in MoveData app from Tricks of the > > Mac Game Programming Gurus) between using doubles in a loop and BlockMove > > in a loop and BlockMove still blew it away (200 ticks vs 146 ticks for > > BlockMove) so why don't more people use BlockMove? > > > > This got me interested, so I went and disassembled BlockMove. Surprisingly, > they aren't using doubles: > > BlockMove > +00060 40A1C558 lwz r5,0x0000(r3) > +00064 40A1C55C lwz r6,0x0004(r3) > +00068 40A1C560 lwz r7,0x0008(r3) > +0006C 40A1C564 lwz r8,0x000C(r3) > +00070 40A1C568 lwz r9,0x0010(r3) > +00074 40A1C56C lwz r10,0x0014(r3) > +00078 40A1C570 lwz r11,0x0018(r3) > +0007C 40A1C574 lwz r12,0x001C(r3) > +00080 40A1C578 dcbz 0,r4 > +00084 40A1C57C addi r3,r3,0x0020 > +00088 40A1C580 dcbt 0,r3 > +0008C 40A1C584 stw r5,0x0000(r4) > +00090 40A1C588 stw r6,0x0004(r4) > +00094 40A1C58C stw r7,0x0008(r4) > +00098 40A1C590 stw r8,0x000C(r4) > +0009C 40A1C594 stw r9,0x0010(r4) > +000A0 40A1C598 stw r10,0x0014(r4) > +000A4 40A1C59C stw r11,0x0018(r4) > +000A8 40A1C5A0 stw r12,0x001C(r4) > +000AC 40A1C5A4 addi r4,r4,0x0020 > +000B0 40A1C5A8 bdnz BlockMove+00060 > > > The performance win is in the dcbz/dcbt pair. I'm assuming you weren't > copying to video memory, because that's marked uncacheable, and dcbz will > severely hurt performance if your destination is uncacheable. > > I probably would have written it more like this, personally. Does anyone > have any idea what makes Apple's better? (Assuming it is...) consecutive stfd's stall both pipelines. This means that (assuming all cache hits) you get one fp store every 3 cycles, compared with one integer store every cycle. The result is 12 cycles to transfer 4 words using fp registers, but only 10 cycles using integer registers. (see page I-175 of the 601 User manual). > ;assume source, destination, and size are all 32-byte aligned > ;set r3 to source address minus 8 and r4 to destination address minus 8 > ;set ctr to size >> 5 > > BlockMoveLoop > lfd fp0,8(r3) > lfd fp1,16(r3) > lfd fp2,24(r3) > lfdu fp3,32(r3) > dcbz 0,r4 > dcbt 0,r3 > stfd fp0,8(r4) > stfd fp1,16(r4) > stfd fp2,24(r4) > stfdu fp3,32(r4) > bdnz BlockMoveLoop > One other problem with your code (and presumably why apple use the apparently wasteful addi instructions rather than load/store with update) is that your dcbt instruction comes too late... fp3 already contains the double at r3 by the time you hit the dcbt 0,r3 instruction, so it has no effect. Much worse, the dcbz always touches the block you wrote the _previous_ time through the loop... this could easily be fixed by preloading r5 with 8 and writing dcbz r5,r4 dcbt r5,r3 But you would still lose out on a 601. I _think_ it would be quicker on a 604, but i've not checked. - -------------------------------------- Mark Williams<Mark@streetly.demon.co.uk> +++++++++++++++++++++++++++ >From cameron_esfahani@powertalk.apple.com (Cameron Esfahani) Date: Tue, 28 Nov 1995 01:24:06 -0800 Organization: Apple Computer, Inc. BlockMoveData was introduced with System 7.5. The code for it was kicking around Apple for a little while before we had a shipping vehicle for it. Cameron Esfahani +++++++++++++++++++++++++++ >From deirdre@deeny.mv.com (Deirdre) Date: Tue, 28 Nov 1995 14:46:04 GMT Organization: Tarla's Secret Clench BlockMove was available in System 1.0. However, the distinction between BlockMove and the newer call BlockMoveData is only significant on 040s and higher. On other machines it is the same trap. _Deirdre +++++++++++++++++++++++++++ >From kenp@nmrfam.wisc.edu (Ken Prehoda) Date: Wed, 29 Nov 1995 09:26:05 -0600 Organization: Univ of Wisconsin-Madison, Dept of Biochemistry As far as I can tell BlockMoveData is _only_ significant on the 040. BlockMove does not flush the cache on the PPC's. _____________________________________________________________________________ Ken Prehoda kenp@nmrfam.wisc.edu Department of Biochemistry http://www.nmrfam.wisc.edu University of Wisconsin-Madison Tel: 608-263-9498 420 Henry Mall Fax: 608-262-3453 +++++++++++++++++++++++++++ >From cameron_esfahani@powertalk.apple.com (Cameron Esfahani) Date: Wed, 29 Nov 1995 22:53:41 -0800 Organization: Apple Computer, Inc. > As far as I can tell BlockMoveData is _only_ significant on the 040. > BlockMove does not flush the cache on the PPC's. That is not true. BlockMove does flush the cache on the new PPCs. Any PPC with a split cache (603/604 and any other ones) will require cache flushing. So, BlockMove on a 601-based machine doesn't flush the cache because it makes no sense, but on > 601-machines, it does flush. Cameron Esfahani +++++++++++++++++++++++++++ >From mick@emf.net (Mick Foley) Date: Wed, 29 Nov 1995 22:23:29 -0800 Organization: "emf.net" Quality Internet Access. (510) 704-2929 (Voice) > As far as I can tell BlockMoveData is _only_ significant on the 040. > BlockMove does not flush the cache on the PPC's. Not on the 601 which has a unified cache. But it should make a big difference on the 603 and 604 which have split data and code caches. Mick +++++++++++++++++++++++++++ >From Ed Wynne <arwyn@engin.umich.edu> Date: 4 Dec 1995 04:09:26 GMT Organization: Arwyn, Inc. Actually, thats almost right... BlockMoveData CAN cause cache flushing on 601-based machines if they are running the DR emulator. The processor cache doesn't get flushed but the emulator's internal cache of recompiled code does. This process is probably a fair amount slower than the real on-chip cache flush since it is a software based operation. To my knowledge the only machines so-far with this configuration would be the 7200 and 7500. (does the 8500 have a 601 option?) -ed --------------------------- C.S.M.P. Digest Tue, 26 Dec 95 Volume 3 : Issue 129 --------------------------- >From steele@isi.edu (Craig S. Steele) Subject: Block copy on 604 slow Date: Tue, 5 Dec 1995 18:30:53 -0800 Organization: USC Information Sciences Institute I'm trying to benchmark block copy rates of various sizes for PowerPCs. My results are disappointing for the 604, and cause me to wonder what it is I don't understand. Testing on a 9500/120, to which I have limited access, gives the following results for copy code using 32-bit integer and 64-bit double load and stores, respectively: Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1 MB/s Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9 MB/s The source array is aligned to 4K, the destination array to 4K+0x100, to avoid possible aliasing interlocks. The source array is preloaded immediately before the copy routine is called, so I would expect everything to run at L1 cache rates. I would naively expect the copy code to average about 1.5 clocks per load or store. Instead, my code reports over 4 clocks/access. The code uses the time-base register for timing, which shouldn't cause significant cache disturbance. Can anyone contradict, corroborate, or explain my poor results? If I can't do better than this, we'll have to build extra hardware :-( Thanks in advance. -Craig exportf2 dvec_copy mtctr r5 ; init loop counter addi r3,r3,-8 ; predecrement pointer by double size addi r4,r4,-8 ; predecrement pointer by double size li r6,8 ; cache line alignment constant for dcbz b dvc_1 align 6 dvc_1 dcbz r6,r3 ; kill dest. cache line lfd fp0,8(r4) lfd fp1,16(r4) lfd fp2,24(r4) lfdu fp3,32(r4) stfd fp0,8(r3) stfd fp1,16(r3) stfd fp2,24(r3) stfdu fp3,32(r3) bdnz dvc_1 ; test loop condition blr Craig S. Steele - Not yet Institutionalized +++++++++++++++++++++++++++ >From rbarris@netcom.com (Robert Barris) Date: Wed, 6 Dec 1995 09:46:47 GMT Organization: NETCOM On-line Communication Services (408 261-4700 guest) In article <9512051830.AA53505@kandor.isi.edu>, Craig S. Steele <steele@isi.edu> wrote: >I'm trying to benchmark block copy rates of various sizes for PowerPCs. My >results are disappointing for the 604, and cause me to wonder what it is I >don't understand. Testing on a 9500/120, to which I have limited access, gives >the following results for copy code using 32-bit integer and 64-bit double >load and stores, respectively: > >Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1 MB/s >Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9 MB/s OK, in regular "bytes", you appear to be moving (for examples sake) 8192 bytes from address (say) 0x1000000 to address (say) 0x1002100. So you are reading 8K and writing 8K as I read it... in a perfect world all of your data would fit (precisely) into the L1 d cache. >The source array is aligned to 4K, the destination array to 4K+0x100, to avoid >possible aliasing interlocks. The source array is preloaded immediately >before the copy routine is called, so I would expect everything to run at L1 >cache rates. Except that you are sharing that L1 with things like interrupt tasks, 68K interrupt tasks (which invoke the emulator causing additional pollution), and so on. Since as far as I know, there is no way to completely shut off PowerPC interrupts, quantifying the effect of background processes on your cache population can be a bit tricky. >I would naively expect the copy code to average about 1.5 clocks per load or >store. Instead, my code reports over 4 clocks/access. The code uses the >time-base register for timing, which shouldn't cause significant cache >disturbance. When you say per access, do you mean per double "moved" as in a read and a write, or per double accessed, as in the read or the write alone? I guess I can work it out: 110MB/s (say it's 120 for arguments sake) is about 1MB per million clocks (at 120MHz). Or about a byte moved per clock, or a double moved per 8 clocks. OK so that's 4 per double read, 4 per double write (on average). Suggestions: 1. Plot speed versus vector length. Look for nonlinearities. (deliberately shrink or grow the vector). 2. wiggle that 256 byte offset factor some more. or make it zero. I do not think the 4-wayness would become a problem until you went above 8K vectors, then very little would help... 3. think about cache hinting at or near the bottom of the loop. if for some reason a cache line which you are going to read from has been dropped, it's good to schedule its re-fetch as far ahead as possible. I'm sure Tim Olson can elaborate much more better good :) 4. I hear Exponential Technology has a faster BiCMOS 604 coming... Rob Barris Quicksilver Software Inc. rbarris@quicksilver.com * opinions expressed not necessarily those of my employer * +++++++++++++++++++++++++++ >From steele@isi.edu (Craig S. Steele) Date: Wed, 6 Dec 1995 12:41:15 -0800 Organization: USC Information Sciences Institute In article <rbarrisDJ5sHz.MJy@netcom.com>, rbarris@netcom.com (Robert Barris) writes: > In article <9512051830.AA53505@kandor.isi.edu>, Craig S. Steele > <steele@isi.edu> wrote: > >I'm trying to benchmark block copy rates of various sizes for > >PowerPCs. My results are disappointing for the 604, and cause me > >to wonder what it is I don't understand. Testing on a 9500/120, to > >which I have limited access, gives the following results for copy > >code using 32-bit integer and 64-bit double load and stores, > >respectively: > > > >Asm lvector copy of 1024 doubles in 44.8 nS/acc, 5.4 clocks/acc, 85.1 MB/s > >Asm dvector copy of 1024 doubles in 34.1 nS/acc, 4.1 clocks/acc, 111.9 MB/s > So you are reading 8K and writing 8K as I read it... in a perfect > world all of your data would fit (precisely) into the L1 d cache. Exactly. However, I did benchmark a range of copy sizes from 512B to 1MB; the quoted 8KB block results were the fastest. Needless to say the rate for larger blocks dropped precipitously as the sizes busted (burst?) the L1 and L2 caches. > >...so I would expect everything to run at L1 cache rates. > Except that you are sharing that L1 with things like interrupt > tasks, 68K interrupt tasks (which invoke the emulator causing > additional pollution), and so on. True. I would have thought that at least some of my trials would have fit in between interrupts, e.g., the critical part of the 8KB case only takes about 100uS, and the smaller proportionately less. I also tried back-to-back copy calls, producing essentially identical results. I did get _much_ worse results when I experimented with using the MacOS Microseconds call for timing, so the cache pollution issue is very real. What is the highest rate interrupt source on an idle PowerMac anyway?. Is Microseconds non-native? I'm clueless. > Since as far as I know, there is no way to completely shut off > PowerPC interrupts, quantifying the effect of background processes on > your cache population can be a bit tricky. I believe I know how to do it on an 8100 (although not the 9500) so it's probably worth a (probable deathcookies) experiment to see it if makes a difference there. I deeply regret having blown up our only hardware prototype last month... Maybe next week I'll have a bare machine again, knock on Formica(TM). > >I would naively expect the copy code to average about 1.5 clocks per > >load or store. Instead, my code reports over 4 clocks/access. > I guess I can work it out ... OK so that's 4 per double > read, 4 per double write (on average). Yes. > Suggestions: > 1. Plot speed versus vector length. Look for nonlinearities. > (deliberately shrink or grow the vector). For a 9500/120: 512B 49 MB/s 1KB 68 2KB 87 4KB 109 8KB 112 16KB 68 32KB 62 64KB 54 128KB 53 256KB 40 512KB 35 1024KB 32 The trends are reasonable, it's just the L1 peak rate that seems very low to me. The 6100 and 8100, on the other hand, have some huge huge anomalous dips for 128KB operations, presumably managing to evict the code from both the L1 & L2 unified caches in some particularly malign way. > 2. wiggle that 256 byte offset factor some more. or make it zero. Zero makes things about 10% slower, but I haven't yet tried other offsets. > 3. think about cache hinting at or near the bottom of the loop. > if for some reason a cache line which you are going to read from > has been dropped, it's good to schedule its re-fetch as far ahead > as possible. A prior load loop is supposed to have ensured that the source is in the cache, but this is a good suggestion to double check that assumption, and probably the right thing to do for a general-purpose copy where cache status is uncontrolled. I'll check this out. > 4. I hear Exponential Technology has a faster BiCMOS 604 coming... That certainly does look interesting, "only" $14million capitalization, but good credentials. Unfortunately, I have to put something under the tree for this Christmas, can't wait for that rosy glow ("Is it Rudolph or is it bipolar?") we might see next. :-) Craig S. Steele - Not yet Institutionalized +++++++++++++++++++++++++++ >From tim@apple.com (Tim Olson) Date: 7 Dec 1995 03:33:26 GMT Organization: Apple Computer, Inc. / Somerset In article <9512051830.AA53505@kandor.isi.edu> steele@isi.edu (Craig S. Steele) writes: > I would naively expect the copy code to average about 1.5 clocks per load or > store. Instead, my code reports over 4 clocks/access. The code uses the > time-base register for timing, which shouldn't cause significant cache > disturbance. > > Can anyone contradict, corroborate, or explain my poor results? I did a number of measurements awhile back which showed that a 604 can perform the loop you gave (without the DCBZ) at about 1.3 cycles per doubleword loaded or stored -- this was done by measuring the runtime of copying a 64-byte block over many iterations, so both source and destination were in the cache. The DCBZ instruction spends multiple cycles clearing the allocated cache block, so that will add some overhead (I don't have my spec with me -- I seem to remember it is 4 cycles), which should bring it to somewhere around 15 cycles per loop iteration, or about 1.8 cycles per doubleword, which is still far less than your reported 4 cycles. First, try running without the DCBZ to see if it more closely matches my results (~1.3 cycles per doubleword); if not, then you might be forgetting about some multiplication factor when using the timebase register. On the 604, it increments every 4th bus clock. -- Tim Olson Apple Computer, Inc. / Somerset tim@apple.com +++++++++++++++++++++++++++ >From cliffc@ami.sps.mot.com (Cliff Click) Date: 7 Dec 95 09:23:08 Organization: none steele@isi.edu (Craig S. Steele) writes: Craig S. Steele <steele@isi.edu> wrote: >I'm trying to benchmark block copy rates of various sizes for >PowerPCs. My results are disappointing for the 604, and cause me >to wonder what it is I don't understand. Testing on a 9500/120, to >which I have limited access, gives the following results for copy >code using 32-bit integer and 64-bit double load and stores, >respectively: Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"? My 604 book sez these are #regs+2 cycles each, whilst the float operations are 3 cycles each. For large enough blocks, you should win on the lmw and stmw. Cliff -- Cliff Click Compiler Researcher & Designer RISC Software, Motorola PowerPC Compilers cliffc@risc.sps.mot.com (512) 891-7240 +++++++++++++++++++++++++++ >From tim@apple.com (Tim Olson) Date: 8 Dec 1995 02:59:57 GMT Organization: Apple Computer, Inc. / Somerset In article <CLIFFC.95Dec7092308@ami.sps.mot.com> cliffc@ami.sps.mot.com (Cliff Click) writes: > Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"? > My 604 book sez these are #regs+2 cycles each, whilst the float > operations are 3 cycles each. For large enough blocks, you should > win on the lmw and stmw. The lfd instruction has a 3-cycle latency for using the result of the load in a floating-point operation, but the issue-rate of lfd is one per cycle. When pipelined in the manner used in the block copy code, it can transfer at close to one doubleword per cycle. Load and store multiple instructions can achieve close to one word per cycle for large transfers, but that is half the bandwith of the lfd/stfd solution. -- Tim Olson Apple Computer, Inc. / Somerset tim@apple.com +++++++++++++++++++++++++++ >From Mark Williams <Mark@streetly.demon.co.uk> Date: Thu, 07 Dec 95 18:25:26 GMT Organization: Streetly Software In article <CLIFFC.95Dec7092308@ami.sps.mot.com>, Cliff Click writes: > > steele@isi.edu (Craig S. Steele) writes: > > Craig S. Steele <steele@isi.edu> wrote: > >I'm trying to benchmark block copy rates of various sizes for > >PowerPCs. My results are disappointing for the 604, and cause me > >to wonder what it is I don't understand. Testing on a 9500/120, to > >which I have limited access, gives the following results for copy > >code using 32-bit integer and 64-bit double load and stores, > >respectively: > > Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"? > My 604 book sez these are #regs+2 cycles each, whilst the float > operations are 3 cycles each. For large enough blocks, you should > win on the lmw and stmw. > > Cliff > -- > Cliff Click Compiler Researcher & Designer > RISC Software, Motorola PowerPC Compilers > cliffc@risc.sps.mot.com (512) 891-7240 But surely the point is that lfd & stfd have a _latency_ of 3 cycles, but a throughput of 1 instruction per cycle, whereas the lmw/stmw have both a latency and throughput of 1 instruction per #regs+2 cycles. That means the lfd/stfd method should be able to move (ie load and store) 1 word per cycle, while the lmw/stmw cannot do better than 1 word every 2 cycles (and even with 28 regs available it would take 60 cycles to move 28 words). - -------------------------------------- Mark Williams<Mark@streetly.demon.co.uk> +++++++++++++++++++++++++++ >From tjrob@bluebird.flw.att.com (Tom Roberts) Date: Sat, 9 Dec 1995 19:19:22 GMT Organization: AT&T Bell Laboratories In article <4a89nd$hrp@cerberus.ibmoto.com>, Tim Olson <tim@apple.com> wrote: >In article <CLIFFC.95Dec7092308@ami.sps.mot.com> >cliffc@ami.sps.mot.com (Cliff Click) writes: > >> Have you tried using "lmw" and "stmw" instead of "lfd" and "stfd"? >> My 604 book sez these are #regs+2 cycles each, whilst the float >> operations are 3 cycles each. For large enough blocks, you should >> win on the lmw and stmw. > >The lfd instruction has a 3-cycle latency for using the result of the >load in a floating-point operation, but the issue-rate of lfd is one >per cycle. When pipelined in the manner used in the block copy code, >it can transfer at close to one doubleword per cycle. > >Load and store multiple instructions can achieve close to one word per >cycle for large transfers, but that is half the bandwith of the >lfd/stfd solution. In practical systems, memory bandwidth is MUCH more important than the number of instructions used or their throughput or latency. (This assumes that the data actually resides in memory, not just in the cache. This also assumes a "long" loop, so the code is in the icache.) In systems which run the 604 at 1:1 clocking (i.e. internal CPU clock equals external bus clock), memory bandwidth can be 2-4 times slower than simple calculations. This is due to cache-access limitations and the fact that both the CPU and the bus access unit are competing for access to the cache. In this mode the memory essentially NEVER overlaps address and data tenures on the bus (halving memory bandwidth); there are usually several bus clocks between succesive cycles, reducing bandwidth even more. With 1.5:1 clocking this effect is reduced -- the cache can handle one access per internal clock, so there is a cycle available to the CPU between every 2 bus accesses. At 2:1 this effect should disappear, as the CPU can get every other cycle, and keep up with the memory bus bandwidth. Note that only recently have 604 chips been shipping which can go 1.5:1 at 66 MHz bus clock. Tom Roberts tjrob@iexist.att.com ---------------------------